This repository contains exploratory data analysis of the Perfume Dataset using R.
The global fragrance industry is both highly competitive and deeply shaped by cultural and consumer preferences. Beyond aesthetics, the market reflects evolving trends in gender identity, lifestyle choices, and purchasing behaviors. For brands, understanding these dynamics is critical in designing product portfolios, targeting marketing campaigns, and identifying opportunities for innovation.
In this report, we analyze a curated dataset of perfumes covering multiple dimensions, including brand, type, category, target audience, and longevity. Our objective is to uncover patterns that reveal how different factors interact and shape consumer preferences.
readr,dplyr,tidyr,stringr,janitor,ggplot2,ggrepel,scales,plotly,tibble,caret,randomForest.# Define function to check is package is installed
install_if_missing <- function(pkg){
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org/")
library(pkg, character.only = TRUE)
}
}
# Install and load library
pkgs <- c("readr","dplyr","tidyr","stringr","janitor","ggplot2","ggrepel","scales","plotly","tibble","vcd","caret","randomForest")
invisible(lapply(pkgs, install_if_missing))
data/ → raw dataset and cleaned datasetscripts/ → R scripts for data cleaning, analysis,
visualization, modelingnotebooks/ → R Markdown for step-by-step analysisoutputs/ → figures, reportsdocs/ → research notes, methodologyggplot2# Read in csv
perfume <- read.csv("../../Data/Perfumes_dataset.csv")
# Standardize
perfume <- perfume |>
janitor::clean_names() |>
dplyr::mutate(
brand = stringr::str_squish(brand),
perfume = stringr::str_squish(perfume),
type = stringr::str_squish(stringr::str_to_lower(type)), # e.g. "edp", "edt"
category = stringr::str_squish(stringr::str_to_title(category)), # "Fresh Scent" etc.
target_audience = stringr::str_squish(stringr::str_to_title(target_audience)), # "Male/Female/Unisex"
longevity = stringr::str_squish(stringr::str_to_title(longevity)) # "Strong/Medium/..."
)
Lets have a look at the first 10 rows of the dataset and the structure of it.
perfume[1:10,]
## brand perfume type category target_audience longevity
## 1 dumont nitro red edp Fresh Scent Male Strong
## 2 dumont nitro pour homme edp Fresh Scent Male Strong
## 3 dumont nitro white edp Fresh Scent Unisex Strong
## 4 dumont nitro blue edp Fresh Scent Unisex Strong
## 5 dumont nitro green edp Fresh Scent Unisex Strong
## 6 dumont nitro platinum edp Mass Pleaser Male Strong
## 7 dumont nitro intense edp Woody Spicy Male Strong
## 8 dumont nitro black edp Woody Spicy Male Strong
## 9 dumont celerio oros edp Oriental Vanilla Unisex Medium
## 10 dumont celerio epic edp Woody Aromatic Male Medium
glimpse(perfume)
## Rows: 1,004
## Columns: 6
## $ brand <chr> "dumont", "dumont", "dumont", "dumont", "dumont", "dum…
## $ perfume <chr> "nitro red", "nitro pour homme", "nitro white", "nitro…
## $ type <chr> "edp", "edp", "edp", "edp", "edp", "edp", "edp", "edp"…
## $ category <chr> "Fresh Scent", "Fresh Scent", "Fresh Scent", "Fresh Sc…
## $ target_audience <chr> "Male", "Male", "Unisex", "Unisex", "Unisex", "Male", …
## $ longevity <chr> "Strong", "Strong", "Strong", "Strong", "Strong", "Str…
summary(perfume)
## brand perfume type category
## Length:1004 Length:1004 Length:1004 Length:1004
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## target_audience longevity
## Length:1004 Length:1004
## Class :character Class :character
## Mode :character Mode :character
brand – The company or label that produces the perfume (e.g., Dumont).
perfume – The name of the fragrance (e.g., Nitro Red).
type – Concentration or formulation of the perfume (e.g., EDP – Eau de Parfum).
category – Classification of the fragrance based on scent family or style (e.g., Fresh Scent, Woody Spicy, Oriental Vanilla).
target_audience – The intended wearer of the perfume (e.g., Male, Female, Unisex).
longevity – Expected performance in terms of duration on the skin (e.g., Strong, Medium).
Nitro Red (Dumont, EDP) – A fresh scent designed for men with strong longevity.
Celerio Oros (Dumont, EDP) – An oriental vanilla fragrance suitable for unisex wearers with medium longevity.
Nitro Black (Dumont, EDP) – A woody spicy perfume for men with strong performance.
Now, I will show you the number of unique value in each column.
# Find number of unique value in each column
sapply(perfume, function(x) length(unique(x)))
## brand perfume type category target_audience
## 55 940 11 157 7
## longevity
## 13
Consumer perceptions of fragrance families are often shaped by cultural associations with gender. Testing whether gender is independent from categories and types allows us to uncover systematic patterns of preference—highlighting which scent profiles are traditionally gendered, and which are bridging across audiences.
gender_in_category <- perfume |>
dplyr::count(category, target_audience, name = "n") |>
dplyr::group_by(category) |>
dplyr::mutate(share_within_category = n / sum(n)) |>
dplyr::arrange(category, dplyr::desc(share_within_category)) |>
dplyr::ungroup()
print(head(gender_in_category, 30))
## # A tibble: 30 × 4
## category target_audience n share_within_category
## <chr> <chr> <int> <dbl>
## 1 Amber Female 1 0.5
## 2 Amber Unisex 1 0.5
## 3 Amber Floral Unisex 31 0.775
## 4 Amber Floral Female 9 0.225
## 5 Amber Fougere Male 1 1
## 6 Amber Fougère Unisex 1 1
## 7 Amber Leather Unisex 1 1
## 8 Amber Musk Unisex 2 1
## 9 Amber Oriental Unisex 2 1
## 10 Amber Oud Unisex 2 1
## # ℹ 20 more rows
gender_in_type <- perfume |>
dplyr::count(type, target_audience, name = "n") |>
dplyr::group_by(type) |>
dplyr::mutate(share_within_type = n / sum(n)) |>
dplyr::arrange(type, dplyr::desc(share_within_type)) |>
dplyr::ungroup()
print(gender_in_type)
## # A tibble: 18 × 4
## type target_audience n share_within_type
## <chr> <chr> <int> <dbl>
## 1 alcohol-free Unisex 1 1
## 2 attar Unisex 1 1
## 3 cologne Female 6 0.545
## 4 cologne Unisex 5 0.455
## 5 concentrate Unisex 2 1
## 6 edp Unisex 311 0.452
## 7 edp Female 224 0.326
## 8 edp Male 153 0.222
## 9 edt Female 69 0.527
## 10 edt Male 35 0.267
## 11 edt Unisex 27 0.206
## 12 extrait Unisex 4 1
## 13 extrait de parfum Unisex 16 0.941
## 14 extrait de parfum Female 1 0.0588
## 15 oil Unisex 3 1
## 16 parfum Male 20 0.541
## 17 parfum Female 12 0.324
## 18 parfum Unisex 5 0.135
# ====== A) Gender × Category ======
tab_cat <- table(perfume$target_audience, perfume$category)
# 卡方检验
chi_cat <- chisq.test(tab_cat)
print(chi_cat)
##
## Pearson's Chi-squared test
##
## data: tab_cat
## X-squared = 1137.3, df = 288, p-value < 2.2e-16
# Cramer's V
cramer_v_cat <- sqrt(chi_cat$statistic / (sum(tab_cat) * (min(dim(tab_cat)) - 1)))
cat("Cramer's V (Gender × Category):", cramer_v_cat, "\n")
## Cramer's V (Gender × Category): 0.7970959
# 残差矩阵转长表
resid_cat <- as.data.frame(as.table(chi_cat$residuals))
colnames(resid_cat) <- c("Gender", "Category", "Residual")
# Top 20 绝对残差
top20_resid_cat <- resid_cat %>%
arrange(desc(abs(Residual))) %>%
slice_head(n = 20)
# 可视化:残差条形图
ggplot(top20_resid_cat, aes(x = reorder(paste(Category, Gender, sep=" - "), abs(Residual)),
y = Residual, fill = Residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values=c("TRUE"="steelblue","FALSE"="tomato"),
labels=c("FALSE"="Under-represented","TRUE"="Over-represented")) +
labs(title="Top 20 Residuals: Gender × Category",
x="Category - Gender", y="Pearson Residual", fill="Interpretation") +
theme_minimal(base_size=13)
Strong gender signals in key categories
Woody Spicy – Male and Woody Aromatic – Male are significantly over-represented, confirming that woody and spicy scent families are strongly aligned with the male market.
Florential – Female, Oriental Floral – Female, and Floral Fruity – Female are also highly over-represented, reflecting the cultural association between floral/ornamental scents and femininity.
Under-representation highlights boundaries
Woody Spicy – Female and Florential – Unisex are strongly under-represented, suggesting that these categories rarely cross into female or unisex positioning.
Similarly, Oriental Floral – Unisex and Florential – Male appear much less than expected, reinforcing the persistence of gendered segmentation.
Unisex products occupy ambiguous ground
Some categories such as Unknown – Unisex and Amber Floral – Unisex show over-representation, indicating that when fragrances are not tied to traditional families, they may be marketed as unisex.
However, in strongly gendered families (floral for women, woody/spicy for men), unisex positioning is under-represented.
📌 Key Insights
Traditional gender associations remain strong: Floral families skew female, woody/spicy families skew male.
Unisex positioning works best in “neutral” or “blended” categories rather than in traditionally gendered ones.
Business implication: Brands aiming to expand unisex lines should focus on hybrid or less traditionally coded categories (e.g., amber, aromatic) rather than attempting to reframe strongly gendered ones.
# ====== 工具函数:Cramér’s V ======
cramers_v <- function(tbl){
chisq <- suppressWarnings(chisq.test(tbl))
chi2 <- unname(chisq$statistic)
n <- sum(tbl)
r <- nrow(tbl)
c <- ncol(tbl)
V <- sqrt(chi2 / (n * (min(r-1, c-1))))
list(
chisq_test = chisq,
cramer_v = V
)
}
# ====== A) Type × Longevity ======
tab_type_long <- table(perfume$type, perfume$longevity)
res_type_long <- cramers_v(tab_type_long)
cat("\n== Q5: Type × Longevity ==\n")
##
## == Q5: Type × Longevity ==
print(res_type_long$chisq_test)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 770.07, df = 99, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_type_long$cramer_v))
## Cramer's V: 0.309
# 残差分析
resid_type <- as.data.frame(as.table(res_type_long$chisq_test$residuals))
colnames(resid_type) <- c("type", "longevity", "residual")
# Top 20 绝对残差
top20_type <- resid_type %>%
arrange(desc(abs(residual))) %>%
slice_head(n = 20)
ggplot(top20_type, aes(x = reorder(paste(type, longevity, sep = " - "), abs(residual)),
y = residual, fill = residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato"),
labels = c("FALSE" = "Under-represented", "TRUE" = "Over-represented")) +
labs(
title = "Top 20 Residuals: Type × Longevity",
x = "Type - Longevity",
y = "Pearson Residual",
fill = "Interpretation"
) +
theme_minimal(base_size = 13)
# ====== B) Category × Longevity ======
tab_cat_long <- table(perfume$category, perfume$longevity)
res_cat_long <- cramers_v(tab_cat_long)
cat("\n== Q5: Category × Longevity ==\n")
##
## == Q5: Category × Longevity ==
print(res_cat_long$chisq_test)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 4818.4, df = 1584, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_cat_long$cramer_v))
## Cramer's V: 0.700
# 残差分析
resid_cat <- as.data.frame(as.table(res_cat_long$chisq_test$residuals))
colnames(resid_cat) <- c("category", "longevity", "residual")
# Top 20 绝对残差
top20_cat <- resid_cat %>%
arrange(desc(abs(residual))) %>%
slice_head(n = 20)
ggplot(top20_cat, aes(x = reorder(paste(category, longevity, sep = " - "), abs(residual)),
y = residual, fill = residual > 0)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
labs(
title = "Top 20 Residuals: Category × Longevity",
x = "Category - Longevity",
y = "Pearson Residual",
fill = "Interpretation"
) +
theme_minimal(base_size = 13)
Our analysis focused on the relationship between fragrance type/category and longevity.
Using chi-square tests and residual analysis, we found a statistically significant association between the two, with very clear directional patterns.
Long-Lasting Fragrances (Very Strong / Strong)
Extrait de Parfum (high-concentration perfumes) is heavily over-represented in the Very Strong longevity group. This aligns perfectly with product positioning: higher concentrations naturally lead to longer-lasting scents.
Woody, Oud, and Rose categories are also strongly over-represented in the Strong group, showing that these ingredients are typically linked with longer longevity.
Conclusion: High-concentration formats combined with deep woody/rose-based notes are the typical market choice for long-lasting perfumes.
Lighter Longevity (Light / Medium)
Eau de Toilette (EDT) is strongly over-represented in the Light group and severely under-represented in the Strong group.
Similarly, fresh and floral light categories tend to underperform in the Medium group, indicating a preference for shorter, lighter experiences.
Conclusion: Lighter concentrations and fresher scent profiles naturally lean toward shorter-lasting usage.
Under-Represented Segments
Many Floral and Oriental Floral fragrances are under-represented in the Medium group. This suggests a “polarized” pattern: they are either formulated as light, fleeting perfumes or pushed directly into strong, long-lasting territory.
Conclusion: Certain categories show a two-pole distribution, rarely occupying the middle ground.
📊 Key Insights
Type and category do influence longevity, and the findings are consistent with fragrance industry intuition:
Higher concentration + heavier notes → longer-lasting scents.
Lower concentration + fresher notes → lighter, shorter-lasting scents.
Business implications:
For markets demanding long-lasting performance, brands should prioritize Extrait de Parfum formats and emphasize Woody / Oud / Rose compositions.
For everyday, casual consumers, the focus should be on EDT / fresh scents.
This analysis bridges consumer expectations with product design decisions, helping brands position products more strategically.
This analysis of the perfume dataset provides a structured view of how the fragrance market is shaped by audience preferences, brand strategies, product categories, and technical attributes such as type and longevity. From Q1 through Q5, several key insights emerge:
Unisex fragrances are no longer niche (Q1). With over one-third of the market, unisex perfumes have surpassed both male- and female-targeted products, reflecting a broad cultural shift toward inclusivity and flexibility in personal expression.
A few brands dominate through large product portfolios (Q2). Jean Paul Gaultier, Paris Corner, and Armaf together account for nearly half of the top 10 market share. Traditional luxury brands remain influential but compete more on brand equity than on sheer variety.
Woody, spicy, and floral–oriental blends define the mainstream market (Q3). Categories such as Woody Spicy and Floriental capture the largest shares, while Eau de Parfum (EDP) is the overwhelmingly dominant type. Niche categories like fresh or aquatic scents remain underrepresented, yet may offer opportunities for differentiation.
Gender preferences are statistically significant and structured (Q4). Chi-square tests confirm strong associations: floral and fruity categories are over-represented among female products, woody and spicy categories dominate male lines, while some blends (e.g., Amber Woody) successfully bridge into unisex markets. This highlights both the persistence of traditional preferences and areas of convergence.
Longevity is shaped by structural choices (Q5). Certain categories and types are systematically associated with stronger or longer-lasting scents, suggesting that product design choices directly influence consumer perception of durability and value.
📊 Strategic Takeaways
Invest in unisex product lines: Demand for gender-neutral fragrances has become mainstream.
Differentiate within dominant categories: The woody and floral–oriental spaces are crowded; innovation is required to stand out.
Balance portfolio strategy: Brands can win either through scale (broad product ranges) or through premium positioning with smaller but iconic collections.
Leverage longevity as a value driver: Positioning long-lasting perfumes within competitive categories may strengthen consumer trust and pricing power.